Air Quality Project Proposal

Visualizing Air Quality of Countries Over Time

Authors

Ng Wei Herng 2302854

Timothy Zoe 2302663

Agcanas Clarence Angelo Misagal 2302823

Yeo Song Chen 2302785

Lee Ru Yuan 2303393

Published

July 7, 2024

These libraries will be required:

library(tidyverse)
Warning: package 'dplyr' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
Warning: package 'readxl' was built under R version 4.4.1
library(readr)
library(dplyr)
library(tidyr)

1 Original Data Visualization in News Media

The visualization titled “Comparing PM2.5 Concentrations in Capital Cities” created by Pallavi Rao (2023) and published on “The Visual Capitalist”, presents a snapshot of PM2.5 air pollution levels in various capital cities around the world for the year 2022. PM2.5 refers to particulate matter that is less than 2.5 micrometers in diameter, which is small enough to penetrate the lungs and enter the bloodstream, posing significant health risks.

This visualization uses a series of red circles to represent the PM2.5 concentrations in each capital city. The number of circles corresponds to the level of PM2.5 concentration that exceeds the World Health Organization’s (WHO) safe limit for PM2.5, which is 5 µg/m³. Any value above this indicates a higher risk for adverse health effects.

Figure 1: Visualized: Air Quality and Pollution in 50 Capital Cities (IQAir 2022 World Air Quality Report)

Figure 1: Visualized: Air Quality and Pollution in 50 Capital Cities (IQAir 2022 World Air Quality Report)

2 Critical Assessment of the Original Visualization

The original visualization made good use of color as the choice of red for the circles emphasizes the health hazard posed by high PM2.5 concentrations. It provided a clear legend and chose a consistent format for its presentation, which made sense for its purpose as a comparative visualization. The data was presented in a ranked manner, making it easier to follow the trends in air quality across different cities.

However, the visualization is overall cluttered and does not scale well, making it tedious when trying to compare or see the trend between a larger number of cities. There are several areas where the visualization could be improved to enhance its effectiveness and user experience:

  1. Overcrowding in High PM2.5 Cities: For cities with extremely high PM2.5 levels, the visualization becomes overcrowded with red circles, making it difficult to discern individual data points while also diminishing the impact of the difference in scale between such cities.
  2. Static Year Selection: The visualization is limited to 2022 data. Including multiple years would provide a more dynamic and comprehensive view of trends over time.
  3. No Regional Differentiation: While each city is labeled, there is no clear regional differentiation which could be useful in understanding broader regional trends and patterns.
  4. Lack of Depth: There are no interactive elements like info-tips or toggle buttons that allow users to explore the data in more depth or switch views between different time periods or concentration ranges.
  5. No Contextual Information: The visualization lacks additional context such as population density, industrial activity, or geographical features that could help explain the varying air quality levels in each city.

3 Proposed Improvements

  1. Alternative Visualization Type: An alternative visual representations such as a color gradient or bar chart could prevent overcrowding and improve clarity.
  2. Expansion on Available Data: Including options to view historical data from before 2022 will help in trend studies regarding the past history of air quality for the country or region.
  3. Geographical Grouping: Grouping the cities by region or continent could make the data more accessible and relevant for viewers interested in specific areas.
  4. Interactive Features: Instead of looking through the whole list of countries or regions to find a specific data point, a filter will allow users to zoom in on specific cities, view time-series data, and explore the sources and effects of pollution.
  5. Inclusion of Contextual Information: Including other relevant data such as population or growth domestic product (GDP) alongside PM2.5 levels could offer insights on the potential cause or impact when air quality rises or dips.
  6. Improved color coding: The color representation of the data can be improved by adding different colors and their corresponding gradients to allow for more distinct data representation (e.g., using green to signify good air quality, yellow to signify moderate air quality and red to signify bad air quality).

4 Data Cleaning

4.1 Data Source Summary

The original data set used for the visualization was sourced from the IQAir World Air Quality Report (2022). While IQAir provides an API for users to obtain their air quality data, the API only gives real-time data instead of data in a time series ranging from the past to recent years.

As such, the data set we have chosen to use comes from the World Health Organization (WHO) which provides data on air quality for various countries (2022). The data set contains information on PM2.5 concentrations for different countries and years and is also more precise as WHO has a 60% inclusion requirement whereby the recorded data require annual data availability of at least 60% of the total number of hours in a year to be included. Alongside this data set, we have also chosen to use 3 additional data set for the purposes of enhancing the visualization, as well as to improve on the data engineering and data cleaning aspect of the WHO data set.

The additional data sets are:

  1. Country Codes (2024): This data set contains the alpha-3 country codes and sub-region category for each country, which will be used to merge with the WHO data set so that we may obtain a more accurate mean for missing PM2.5 values based on sub-region.
  2. Population Data (2024): This data set contains the population for each country, which will be used to provide context to the PM2.5 data within our visualization.
  3. GDP Data (2024): This data set contains the GDP for each country, which will be used to provide context to the PM2.5 data within our visualization.

Below is a glimpse and summary of the WHO air quality data set.

who_data <- read_excel("who_aap_2021_v9_11august2022.xlsx", sheet = "AAP_2022_city_v9")

regional_data <- read.csv("all.csv")
glimpse(who_data)
Rows: 32,191
Columns: 15
$ `WHO Region`                             <chr> "Eastern Mediterranean Region…
$ ISO3                                     <chr> "AFG", "ALB", "ALB", "ALB", "…
$ `WHO Country Name`                       <chr> "Afghanistan", "Albania", "Al…
$ `City or Locality`                       <chr> "Kabul", "Durres", "Durres", …
$ `Measurement Year`                       <dbl> 2019, 2015, 2016, 2015, 2016,…
$ `PM2.5 (μg/m3)`                          <dbl> 119.77, NA, 14.32, NA, NA, NA…
$ `PM10 (μg/m3)`                           <dbl> NA, 17.65, 24.56, NA, NA, NA,…
$ `NO2 (μg/m3)`                            <dbl> NA, 26.63, 24.78, 23.96, 26.2…
$ `PM25 temporal coverage (%)`             <dbl> 18, NA, NA, NA, NA, NA, NA, N…
$ `PM10 temporal coverage (%)`             <dbl> NA, NA, NA, NA, NA, NA, NA, N…
$ `NO2 temporal coverage (%)`              <dbl> NA, 83.96119, 87.93260, 97.85…
$ Reference                                <chr> "U.S. Department of State, Un…
$ `Number and type of monitoring stations` <chr> "NA", "NA", "NA", "NA", "NA",…
$ `Version of the database`                <dbl> 2022, 2022, 2022, 2022, 2022,…
$ Status                                   <lgl> NA, NA, NA, NA, NA, NA, NA, N…
summary(who_data)
  WHO Region            ISO3           WHO Country Name   City or Locality  
 Length:32191       Length:32191       Length:32191       Length:32191      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 Measurement Year PM2.5 (μg/m3)     PM10 (μg/m3)     NO2 (μg/m3)    
 Min.   :2000     Min.   :  0.01   Min.   :  1.04   Min.   :  0.00  
 1st Qu.:2014     1st Qu.: 10.35   1st Qu.: 16.98   1st Qu.: 12.00  
 Median :2016     Median : 16.00   Median : 22.00   Median : 18.80  
 Mean   :2016     Mean   : 22.92   Mean   : 30.53   Mean   : 20.62  
 3rd Qu.:2018     3rd Qu.: 31.00   3rd Qu.: 31.30   3rd Qu.: 27.16  
 Max.   :2021     Max.   :191.90   Max.   :540.00   Max.   :210.68  
                  NA's   :17143    NA's   :11082    NA's   :9991    
 PM25 temporal coverage (%) PM10 temporal coverage (%)
 Min.   :  0.00             Min.   :  2.568           
 1st Qu.: 88.60             1st Qu.: 87.945           
 Median : 97.00             Median : 96.039           
 Mean   : 90.79             Mean   : 90.583           
 3rd Qu.: 99.00             3rd Qu.: 98.938           
 Max.   :100.00             Max.   :100.000           
 NA's   :24916              NA's   :26810             
 NO2 temporal coverage (%)  Reference        
 Min.   :  1.923           Length:32191      
 1st Qu.: 93.208           Class :character  
 Median : 96.370           Mode  :character  
 Mean   : 93.697                             
 3rd Qu.: 98.927                             
 Max.   :100.000                             
 NA's   :12301                               
 Number and type of monitoring stations Version of the database  Status       
 Length:32191                           Min.   :2016            Mode:logical  
 Class :character                       1st Qu.:2022            NA's:32191    
 Mode  :character                       Median :2022                          
                                        Mean   :2022                          
                                        3rd Qu.:2022                          
                                        Max.   :2022                          
                                                                              

4.2 Handling of Missing Values

Based on the above summaries, we can see that there are missing values in the data set in the PM columns. We will need to handle these missing values before proceeding with the changes. Some methods we can use to handle missing values include:

  1. Dropping Missing Values: We can drop rows with missing values if they are not significant in number.
  2. Imputation: We can impute missing values with the mean, median, or mode of the column.

We will impute missing values by merging the WHO data set with the country codes data set to obtain the sub-region category for each country. We will then calculate the mean PM2.5 concentration for each sub-region and impute the missing values with the mean PM2.5 concentration of the corresponding sub-region.

# Merge the data
merged_data <- who_data %>%
  left_join(regional_data %>% select(alpha.3, sub.region), by = c("ISO3" = "alpha.3")) %>%
  select(1:6, sub.region)

head(merged_data)
# A tibble: 6 × 7
  `WHO Region`    ISO3  `WHO Country Name` `City or Locality` `Measurement Year`
  <chr>           <chr> <chr>              <chr>                           <dbl>
1 Eastern Medite… AFG   Afghanistan        Kabul                            2019
2 European Region ALB   Albania            Durres                           2015
3 European Region ALB   Albania            Durres                           2016
4 European Region ALB   Albania            Elbasan                          2015
5 European Region ALB   Albania            Elbasan                          2016
6 European Region ALB   Albania            Elbasan                          2017
# ℹ 2 more variables: `PM2.5 (μg/m3)` <dbl>, sub.region <chr>
# Function to impute missing PM2.5 values
impute_pm25 <- function(data) {
  data %>%
    group_by(sub.region, `Measurement Year`) %>%
    mutate(pm25_mean = mean(`PM2.5 (μg/m3)`, na.rm = TRUE)) %>%
    mutate(`PM2.5 (μg/m3)` = ifelse(is.na(`PM2.5 (μg/m3)`), pm25_mean, `PM2.5 (μg/m3)`)) %>%
    select(-pm25_mean)
}

imputed_data <- impute_pm25(merged_data)

head(imputed_data)
# A tibble: 6 × 7
# Groups:   sub.region, Measurement Year [4]
  `WHO Region`    ISO3  `WHO Country Name` `City or Locality` `Measurement Year`
  <chr>           <chr> <chr>              <chr>                           <dbl>
1 Eastern Medite… AFG   Afghanistan        Kabul                            2019
2 European Region ALB   Albania            Durres                           2015
3 European Region ALB   Albania            Durres                           2016
4 European Region ALB   Albania            Elbasan                          2015
5 European Region ALB   Albania            Elbasan                          2016
6 European Region ALB   Albania            Elbasan                          2017
# ℹ 2 more variables: `PM2.5 (μg/m3)` <dbl>, sub.region <chr>

4.3 Normalizing Column Names

We will also need to normalize the column names to ensure consistency and ease of access. This will involve converting all column names to lowercase, replacing spaces with underscores, and removing special characters.

# Standardize column names
colnames(imputed_data) <- tolower(c("WHO_Region", "ISO3", "WHO_Country_Name", "City_or_Locality", 
                                    "Measurement_Year", "PM2_5", "sub_region"))

head(imputed_data)
# A tibble: 6 × 7
# Groups:   sub_region, measurement_year [4]
  who_region      iso3  who_country_name city_or_locality measurement_year pm2_5
  <chr>           <chr> <chr>            <chr>                       <dbl> <dbl>
1 Eastern Medite… AFG   Afghanistan      Kabul                        2019 120. 
2 European Region ALB   Albania          Durres                       2015  15.5
3 European Region ALB   Albania          Durres                       2016  14.3
4 European Region ALB   Albania          Elbasan                      2015  15.5
5 European Region ALB   Albania          Elbasan                      2016  14.2
6 European Region ALB   Albania          Elbasan                      2017  15.2
# ℹ 1 more variable: sub_region <chr>

4.4 Data Type Conversion

We will convert the data types of the columns to their appropriate types to be safe.

# Convert data types of columns
imputed_data <- imputed_data %>%
  mutate(
    who_region = as.factor(who_region),
    iso3 = as.factor(iso3),
    who_country_name = as.factor(who_country_name),
    city_or_locality = as.factor(city_or_locality),
    measurement_year = as.numeric(measurement_year),
    pm2_5 = as.numeric(pm2_5),
    sub_region = as.factor(sub_region)
  )

4.5 Removing Duplicates

We will check for and remove any duplicate rows in the dataset to ensure data integrity.

initial_row_count <- nrow(imputed_data)

imputed_data <- imputed_data %>%
distinct()

# Check if any duplicates were removed
final_row_count <- nrow(imputed_data)
duplicates_removed <- initial_row_count - final_row_count

if (duplicates_removed > 0) {
  message(duplicates_removed, " duplicate rows were found and removed.")
} else {
  message("No duplicate rows were found.")
}
141 duplicate rows were found and removed.

4.6 Population Data & GDP Data

The steps to clean the GDP data is the same as the steps to clean the population data.

To clean the population data, we need to remove the metadata rows, reshape the data to have their respective year and population column so that we can merge it with the WHO data set.

# Read in the dataset and skip the first 4 rows
population_data <- read_csv("API_SP.POP.TOTL_DS2_en_csv_v2_580248.csv", skip = 4)
New names:
Rows: 266 Columns: 69
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (64): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (1): ...69
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...69`
# Gather the year columns into key-value pairs of year and population
population_data <- population_data %>%
  gather(key = "year", value = "population", -`Country Name`, -`Country Code`, -`Indicator Name`, -`Indicator Code`) %>%
  mutate(year = ifelse(grepl("^\\d+$", year), as.integer(year), NA_integer_))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `year = ifelse(grepl("^\\d+$", year), as.integer(year),
  NA_integer_)`.
Caused by warning in `ifelse()`:
! NAs introduced by coercion
# Rename columns for clarity
colnames(population_data) <- c("country_name", "country_code", "indicator_name", "indicator_code", "year", "population")

write_csv(population_data, "cleaned_population_data.csv")

head(population_data)
# A tibble: 6 × 6
  country_name       country_code indicator_name indicator_code  year population
  <chr>              <chr>        <chr>          <chr>          <int>      <dbl>
1 Aruba              ABW          Population, t… SP.POP.TOTL     1960      54608
2 Africa Eastern an… AFE          Population, t… SP.POP.TOTL     1960  130692579
3 Afghanistan        AFG          Population, t… SP.POP.TOTL     1960    8622466
4 Africa Western an… AFW          Population, t… SP.POP.TOTL     1960   97256290
5 Angola             AGO          Population, t… SP.POP.TOTL     1960    5357195
6 Albania            ALB          Population, t… SP.POP.TOTL     1960    1608800

4.7 Cleaned Data

This is the cleaned data

write_csv(imputed_data, "cleaned_imputed_data.csv")

head(imputed_data)
# A tibble: 6 × 7
# Groups:   sub_region, measurement_year [4]
  who_region      iso3  who_country_name city_or_locality measurement_year pm2_5
  <fct>           <fct> <fct>            <fct>                       <dbl> <dbl>
1 Eastern Medite… AFG   Afghanistan      Kabul                        2019 120. 
2 European Region ALB   Albania          Durres                       2015  15.5
3 European Region ALB   Albania          Durres                       2016  14.3
4 European Region ALB   Albania          Elbasan                      2015  15.5
5 European Region ALB   Albania          Elbasan                      2016  14.2
6 European Region ALB   Albania          Elbasan                      2017  15.2
# ℹ 1 more variable: sub_region <fct>

5 Conclusion

The data has now been cleaned and is ready for visualization, we will be using ggplot2 to create the visualizations and ggplotly to render the plot interactive. The proposed improvements will be implemented to enhance the clarity and depth of the visualization, providing a more interactive and informative experience for users. By incorporating these enhancements, we aim to create a more engaging and insightful visualization that effectively communicates the trends in air quality across countries globally.

6 References

  1. Rao, P. (2024, January 6). Visualized: Air quality and pollution in 50 capital cities. Visual Capitalist
  2. Air quality database 2022. (2024, June 20)
  3. ISO-3166-Countries-with-Regional-Codes. (2024, June 19)
  4. World Bank Group Population Data. (2024)
  5. World Bank Group GDP Data. (2024)